2 research outputs found
Circular sequence comparison: algorithms and applications
Background: Sequence comparison is a fundamental step in many important tasks in bioinformatics; from phylogenetic reconstruction to the reconstruction of genomes. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular molecular structure is a common phenomenon in nature, a caveat of the adaptation of alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. Results: In this paper, we introduce a new distance measure based on q-grams, and show how it can be applied effectively and computed efficiently for circular sequence comparison. Experimental results, using real DNA, RNA, and protein sequences as well as synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art
Asymptotically Optimal Encodings of Range Data Structures for Selection and Top-k Queries
Given an array A[1, n] of elements with a total order, we consider the problem of building a
data structure that solves two queries: (a) selection queries receive a range [i, j] and an integer
k and return the position of the kth largest element in A[i, j]; (b) top-k queries receive [i, j] and
k and return the positions of the k largest elements in A[i, j]. These problems can be solved in
optimal time, O(1 + lg k/ lg lg n) and O(k), respectively, using linear-space data structures.
We provide the first study of the encoding data structures for the above problems, where A
cannot be accessed at query time. Several applications are interested in the relative order of the
entries of A, and their positions, rather their actual values, and thus we do not need to keep A
at query time. In those cases, encodings save storage space: we first show that any encoding
answering such queries requires n lg k − O(n + k lg k) bits of space; then, we design encodings
using O(n lg k) bits, that is, asymptotically optimal up to constant factors, while preserving
optimal query time